The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Tue Jun 9 20:57:33 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Tue Jun  9 20:57:33 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.8.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.8.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.8.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.8.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 139 139
Albania 139 139
Algeria 139 139
Andorra 139 139
Angola 139 139
Antigua and Barbuda 139 139
Argentina 139 139
Armenia 139 139
Australia 1112 1112
Austria 139 139
Azerbaijan 139 139
Bahamas 139 139
Bahrain 139 139
Bangladesh 139 139
Barbados 139 139
Belarus 139 139
Belgium 139 139
Belize 139 139
Benin 139 139
Bhutan 139 139
Bolivia 139 139
Bosnia and Herzegovina 139 139
Botswana 139 139
Brazil 139 139
Brunei 139 139
Bulgaria 139 139
Burkina Faso 139 139
Burma 139 139
Burundi 139 139
Cabo Verde 139 139
Cambodia 139 139
Cameroon 139 139
Canada 1946 1946
Central African Republic 139 139
Chad 139 139
Chile 139 139
China 4587 4587
Colombia 139 139
Comoros 139 139
Congo (Brazzaville) 139 139
Congo (Kinshasa) 139 139
Costa Rica 139 139
Cote d’Ivoire 139 139
Croatia 139 139
Cuba 139 139
Cyprus 139 139
Czechia 139 139
Denmark 417 417
Diamond Princess 139 139
Djibouti 139 139
Dominica 139 139
Dominican Republic 139 139
Ecuador 139 139
Egypt 139 139
El Salvador 139 139
Equatorial Guinea 139 139
Eritrea 139 139
Estonia 139 139
Eswatini 139 139
Ethiopia 139 139
Fiji 139 139
Finland 139 139
France 1529 1529
Gabon 139 139
Gambia 139 139
Georgia 139 139
Germany 139 139
Ghana 139 139
Greece 139 139
Grenada 139 139
Guatemala 139 139
Guinea 139 139
Guinea-Bissau 139 139
Guyana 139 139
Haiti 139 139
Holy See 139 139
Honduras 139 139
Hungary 139 139
Iceland 139 139
India 139 139
Indonesia 139 139
Iran 139 139
Iraq 139 139
Ireland 139 139
Israel 139 139
Italy 139 139
Jamaica 139 139
Japan 139 139
Jordan 139 139
Kazakhstan 139 139
Kenya 139 139
Korea, South 139 139
Kosovo 139 139
Kuwait 139 139
Kyrgyzstan 139 139
Laos 139 139
Latvia 139 139
Lebanon 139 139
Lesotho 139 139
Liberia 139 139
Libya 139 139
Liechtenstein 139 139
Lithuania 139 139
Luxembourg 139 139
Madagascar 139 139
Malawi 139 139
Malaysia 139 139
Maldives 139 139
Mali 139 139
Malta 139 139
Mauritania 139 139
Mauritius 139 139
Mexico 139 139
Moldova 139 139
Monaco 139 139
Mongolia 139 139
Montenegro 139 139
Morocco 139 139
Mozambique 139 139
MS Zaandam 139 139
Namibia 139 139
Nepal 139 139
Netherlands 695 695
New Zealand 139 139
Nicaragua 139 139
Niger 139 139
Nigeria 139 139
North Macedonia 139 139
Norway 139 139
Oman 139 139
Pakistan 139 139
Panama 139 139
Papua New Guinea 139 139
Paraguay 139 139
Peru 139 139
Philippines 139 139
Poland 139 139
Portugal 139 139
Qatar 139 139
Romania 139 139
Russia 139 139
Rwanda 139 139
Saint Kitts and Nevis 139 139
Saint Lucia 139 139
Saint Vincent and the Grenadines 139 139
San Marino 139 139
Sao Tome and Principe 139 139
Saudi Arabia 139 139
Senegal 139 139
Serbia 139 139
Seychelles 139 139
Sierra Leone 139 139
Singapore 139 139
Slovakia 139 139
Slovenia 139 139
Somalia 139 139
South Africa 139 139
South Sudan 139 139
Spain 139 139
Sri Lanka 139 139
Sudan 139 139
Suriname 139 139
Sweden 139 139
Switzerland 139 139
Syria 139 139
Taiwan* 139 139
Tajikistan 139 139
Tanzania 139 139
Thailand 139 139
Timor-Leste 139 139
Togo 139 139
Trinidad and Tobago 139 139
Tunisia 139 139
Turkey 139 139
Uganda 139 139
Ukraine 139 139
United Arab Emirates 139 139
United Kingdom 1529 1529
Uruguay 139 139
US 139 139
US_state 453279 453279
Uzbekistan 139 139
Venezuela 139 139
Vietnam 139 139
West Bank and Gaza 139 139
Western Sahara 139 139
Yemen 139 139
Zambia 139 139
Zimbabwe 139 139
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 5145
Alaska 1000
Arizona 1271
Arkansas 5430
California 4845
Colorado 4667
Connecticut 754
Delaware 315
Diamond Princess 84
District of Columbia 85
Florida 5450
Georgia 12264
Grand Princess 85
Guam 85
Hawaii 437
Idaho 2479
Illinois 7145
Indiana 7113
Iowa 6680
Kansas 5603
Kentucky 8026
Louisiana 5125
Maine 1304
Maryland 1985
Massachusetts 1267
Michigan 6250
Minnesota 5960
Mississippi 6358
Missouri 7165
Montana 2217
Nebraska 4271
Nevada 996
New Hampshire 895
New Jersey 1905
New Mexico 2196
New York 4845
North Carolina 7536
North Dakota 2651
Northern Mariana Islands 70
Ohio 6709
Oklahoma 5156
Oregon 2583
Pennsylvania 5271
Puerto Rico 85
Rhode Island 502
South Carolina 3697
South Dakota 3364
Tennessee 7271
Texas 15569
Utah 1162
Vermont 1200
Virgin Islands 85
Virginia 9542
Washington 3307
West Virginia 3543
Wisconsin 5138
Wyoming 1596
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 52764
China 139
Diamond Princess 120
Korea, South 110
Japan 109
Italy 107
Iran 104
Singapore 101
France 100
Germany 100
Spain 99
US 98
Switzerland 96
United Kingdom 96
Belgium 95
Netherlands 95
Norway 95
Sweden 95
Austria 93
Malaysia 92
Australia 91
Bahrain 91
Denmark 91
Canada 90
Qatar 90
Iceland 89
Brazil 88
Czechia 88
Finland 88
Greece 88
Iraq 88
Israel 88
Portugal 88
Slovenia 88
Egypt 87
Estonia 87
India 87
Ireland 87
Kuwait 87
Philippines 87
Poland 87
Romania 87
Saudi Arabia 87
Indonesia 86
Lebanon 86
San Marino 86
Thailand 86
Chile 85
Pakistan 85
Luxembourg 84
Peru 84
Russia 84
Ecuador 83
Mexico 83
Slovakia 83
South Africa 83
United Arab Emirates 83
Armenia 82
Colombia 82
Croatia 82
Panama 82
Serbia 82
Taiwan* 82
Turkey 82
Argentina 81
Bulgaria 81
Latvia 81
Uruguay 81
Algeria 80
Costa Rica 80
Dominican Republic 80
Hungary 80
Andorra 79
Bosnia and Herzegovina 79
Jordan 79
Lithuania 79
Morocco 79
New Zealand 79
North Macedonia 79
Vietnam 79
Albania 78
Cyprus 78
Malta 78
Moldova 78
Brunei 77
Burkina Faso 77
Sri Lanka 77
Tunisia 77
Ukraine 76
Azerbaijan 75
Ghana 75
Kazakhstan 75
Oman 75
Senegal 75
Venezuela 75
Afghanistan 74
Cote d’Ivoire 74
Cuba 73
Mauritius 73
Uzbekistan 73
Cambodia 72
Cameroon 72
Honduras 72
Nigeria 72
West Bank and Gaza 72
Belarus 71
Georgia 71
Bolivia 70
Kosovo 70
Kyrgyzstan 70
Montenegro 70
Congo (Kinshasa) 69
Kenya 68
Niger 67
Guinea 66
Rwanda 66
Trinidad and Tobago 66
Paraguay 65
Bangladesh 64
Djibouti 62
El Salvador 61
Guatemala 60
Madagascar 59
Mali 58
Congo (Brazzaville) 55
Jamaica 55
Gabon 53
Somalia 53
Tanzania 53
Ethiopia 52
Burma 51
Sudan 50
Liberia 49
Maldives 47
Equatorial Guinea 46
Cabo Verde 44
Sierra Leone 42
Guinea-Bissau 41
Togo 41
Zambia 40
Eswatini 39
Chad 38
Tajikistan 37
Haiti 35
Sao Tome and Principe 35
Benin 33
Nepal 33
Uganda 33
Central African Republic 32
South Sudan 32
Guyana 30
Mozambique 29
Yemen 25
Mongolia 24
Mauritania 21
Nicaragua 21
Malawi 15
Syria 15
Zimbabwe 13
Bahamas 12
Libya 12
Comoros 10
Suriname 2
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 139
Korea, South 110
Japan 109
Italy 107
Iran 104
Singapore 101
France 100
Germany 100
Spain 99
US 98
Switzerland 96
United Kingdom 96
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-05-18        18400
## 2    Afghanistan           <NA> <NA> 2020-05-16        18398
## 3    Afghanistan           <NA> <NA> 2020-04-06        18358
## 4    Afghanistan           <NA> <NA> 2020-01-26        18287
## 5    Afghanistan           <NA> <NA> 2020-01-27        18288
## 6    Afghanistan           <NA> <NA> 2020-01-25        18286
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                    173                  7072     0.02446267
## 2                    168                  6402     0.02624180
## 3                     11                   367     0.02997275
## 4                      0                     0            NaN
## 5                      0                     0            NaN
## 6                      0                     0            NaN
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  3.849542                   2.238046        18348
## 2                  3.806316                   2.225309        18348
## 3                  2.564666                   1.041393        18348
## 4                      -Inf                       -Inf        18348
## 5                      -Inf                       -Inf        18348
## 6                      -Inf                       -Inf        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1             52  NA   NA         NA                           NA
## 2             50  NA   NA         NA                           NA
## 3             10  NA   NA         NA                           NA
## 4            -61  NA   NA         NA                           NA
## 5            -60  NA   NA         NA                           NA
## 6            -62  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Retail_Recreation 0.9985322 -56.0
Hawaii Grocery_Pharmacy 0.9970724 -34.0
New Hampshire Parks 0.9539163 -20.0
Vermont Parks 0.9177900 -35.5
Maine Transit -0.9151315 -50.0
Utah Residential -0.8924845 12.0
Connecticut Grocery_Pharmacy -0.8877723 -6.0
South Dakota Parks 0.8309094 -26.0
Utah Transit -0.8175079 -18.0
Hawaii Parks 0.7959027 -72.0
Wyoming Parks -0.7865584 -4.0
Rhode Island Workplace -0.7648509 -39.5
Arizona Grocery_Pharmacy -0.7643650 -15.0
Alaska Workplace -0.7604741 -33.0
Massachusetts Workplace -0.7528501 -39.0
Connecticut Transit -0.7497088 -50.0
Hawaii Transit 0.7434437 -89.0
Wyoming Transit -0.6832384 -17.0
Alaska Grocery_Pharmacy -0.6778509 -7.0
Hawaii Residential -0.6687986 19.0
Vermont Grocery_Pharmacy -0.6546694 -25.0
New York Workplace -0.6505779 -34.5
Alaska Residential 0.6471240 13.0
Utah Parks -0.6402874 17.0
Rhode Island Retail_Recreation -0.6264114 -45.0
Maine Workplace -0.6065808 -30.0
New Jersey Parks -0.6033903 -6.0
Rhode Island Residential -0.6030669 18.5
North Dakota Parks 0.6014647 -34.0
Utah Workplace -0.5949695 -37.0
Arizona Residential 0.5906843 13.0
New York Retail_Recreation -0.5885120 -46.0
Nebraska Workplace 0.5834833 -32.0
Arizona Retail_Recreation -0.5781424 -42.5
Nevada Transit -0.5743772 -20.0
New Jersey Workplace -0.5468135 -44.0
New York Parks 0.5291821 20.0
Idaho Residential -0.5215653 11.0
Connecticut Residential 0.5130787 14.0
Connecticut Retail_Recreation -0.5107765 -45.0
Massachusetts Retail_Recreation -0.4993158 -44.0
Maine Parks 0.4911095 -31.0
Kansas Parks 0.4909009 72.0
New Jersey Grocery_Pharmacy -0.4854687 2.5
Montana Parks -0.4808976 -58.0
New Mexico Grocery_Pharmacy -0.4781020 -11.0
Montana Residential 0.4776148 14.0
West Virginia Parks 0.4767017 -33.0
Connecticut Workplace -0.4745247 -39.0
Nebraska Residential -0.4695483 14.0
Iowa Parks -0.4644499 28.5
North Dakota Retail_Recreation -0.4528857 -42.0
Rhode Island Parks 0.4526207 52.0
New Mexico Residential 0.4502963 13.5
New Jersey Retail_Recreation -0.4403281 -62.5
Maryland Workplace -0.4337111 -35.0
Illinois Transit -0.4336684 -31.0
Pennsylvania Workplace -0.4249515 -36.0
New Mexico Parks 0.4222156 -31.5
South Carolina Workplace 0.4210104 -30.0
Wyoming Workplace -0.4192610 -31.0
New Jersey Transit -0.4088658 -50.5
Massachusetts Grocery_Pharmacy -0.4086248 -7.0
Arizona Transit 0.4057516 -38.0
New Hampshire Residential -0.4040777 14.0
Vermont Residential 0.4013805 11.5
Maryland Grocery_Pharmacy -0.3984897 -10.0
Alabama Grocery_Pharmacy -0.3973738 -2.0
Hawaii Workplace 0.3886062 -46.0
Pennsylvania Retail_Recreation -0.3830378 -45.0
Alabama Workplace -0.3728917 -29.0
New York Transit -0.3694369 -48.0
Kentucky Parks -0.3690965 28.5
Oregon Workplace -0.3601304 -31.0
Michigan Parks 0.3567905 30.0
Idaho Grocery_Pharmacy -0.3556557 -5.5
Wisconsin Transit -0.3549904 -23.5
New Mexico Retail_Recreation -0.3518068 -42.5
Idaho Workplace -0.3517754 -29.0
California Transit -0.3450494 -42.0
Nebraska Grocery_Pharmacy 0.3426015 -0.5
Oregon Parks -0.3334522 16.5
California Residential 0.3279013 14.0
Arkansas Parks 0.3273195 -12.0
Wyoming Grocery_Pharmacy -0.3225819 -10.0
North Dakota Workplace 0.3218716 -40.0
Virginia Transit -0.3187846 -33.0
Idaho Transit -0.3186793 -30.0
Arkansas Retail_Recreation -0.3140414 -30.0
Maryland Retail_Recreation -0.3137260 -39.0
California Parks -0.3032800 -38.5
Maine Retail_Recreation -0.3031767 -42.0
Alaska Retail_Recreation 0.2967517 -39.0
Minnesota Transit -0.2965898 -28.5
Illinois Workplace -0.2918265 -30.5
North Carolina Grocery_Pharmacy 0.2892533 0.0
Vermont Retail_Recreation 0.2851285 -57.0
Pennsylvania Parks 0.2841631 12.0
Montana Transit 0.2838779 -41.0
Colorado Residential 0.2829797 14.0
Nevada Residential 0.2815815 17.0
Mississippi Residential 0.2801259 13.0
Alabama Transit -0.2769608 -36.5
Florida Residential 0.2753286 14.0
North Carolina Workplace 0.2726502 -31.0
Missouri Residential -0.2646283 13.0
Texas Workplace 0.2637419 -32.0
Rhode Island Grocery_Pharmacy 0.2634588 -7.5
Georgia Grocery_Pharmacy -0.2599117 -10.0
Maryland Residential 0.2581497 15.0
Texas Residential -0.2566939 15.0
Rhode Island Transit -0.2528446 -56.0
Kansas Workplace 0.2520224 -32.0
Pennsylvania Grocery_Pharmacy -0.2504516 -6.0
Illinois Parks 0.2497361 26.5
Tennessee Workplace -0.2482152 -31.0
South Carolina Parks -0.2470717 -23.0
Vermont Workplace -0.2468536 -43.0
West Virginia Grocery_Pharmacy -0.2452513 -6.0
Florida Parks -0.2338670 -43.0
New York Grocery_Pharmacy -0.2303935 8.0
Tennessee Residential 0.2277258 11.5
Montana Workplace -0.2221924 -40.0
Nevada Retail_Recreation -0.2201506 -43.0
Arkansas Residential 0.2199386 12.0
Wisconsin Parks 0.2162503 51.5
South Dakota Workplace 0.2155835 -35.0
North Dakota Grocery_Pharmacy -0.2117667 -8.0
Georgia Workplace -0.2105708 -33.5
Idaho Retail_Recreation -0.2081612 -39.5
Illinois Residential 0.2076571 14.0
North Carolina Transit 0.2056615 -32.0
Utah Retail_Recreation -0.2047067 -40.0
North Carolina Residential 0.2038046 13.0
Washington Grocery_Pharmacy 0.2028195 -7.0
Georgia Retail_Recreation -0.2021523 -41.0
Minnesota Parks 0.1999528 -9.0
Michigan Workplace -0.1989224 -40.0
Kansas Grocery_Pharmacy -0.1957458 -14.0
Mississippi Grocery_Pharmacy -0.1944984 -8.0
West Virginia Workplace 0.1940842 -33.0
Oregon Residential 0.1900900 10.5
Colorado Parks -0.1885761 2.0
Oregon Transit 0.1867898 -27.5
Virginia Residential 0.1843451 14.0
Texas Parks 0.1804303 -42.0
Connecticut Parks 0.1799764 43.0
Wisconsin Residential -0.1767192 14.0
Arizona Parks -0.1717298 -44.5
Pennsylvania Transit -0.1692248 -42.0
New Jersey Residential 0.1681071 18.0
Kentucky Grocery_Pharmacy 0.1666028 4.0
New Mexico Transit 0.1634764 -38.5
Massachusetts Parks 0.1619406 39.0
Indiana Residential 0.1614146 12.0
New Hampshire Retail_Recreation -0.1592690 -41.0
Nevada Workplace -0.1586991 -40.0
Nebraska Transit -0.1585015 -9.0
Iowa Transit 0.1583406 -24.0
Kentucky Transit -0.1579182 -31.0
Ohio Transit 0.1565127 -28.0
California Grocery_Pharmacy -0.1551077 -11.5
Missouri Workplace 0.1541188 -28.0
Indiana Parks -0.1497791 29.0
Indiana Retail_Recreation 0.1471317 -38.0
South Dakota Residential 0.1447030 15.0
Virginia Grocery_Pharmacy -0.1432400 -8.0
Alabama Parks 0.1431795 -1.0
Florida Retail_Recreation 0.1415190 -43.0
California Retail_Recreation -0.1403881 -44.0
Mississippi Retail_Recreation -0.1358597 -40.0
Michigan Retail_Recreation -0.1355428 -53.0
Washington Workplace -0.1320207 -38.0
Mississippi Transit -0.1275834 -38.5
Georgia Residential -0.1275064 13.0
Minnesota Retail_Recreation 0.1273530 -40.0
Wisconsin Workplace -0.1268127 -31.0
North Dakota Transit 0.1250944 -48.0
South Dakota Transit -0.1250878 -40.0
Oregon Grocery_Pharmacy 0.1235586 -7.0
Missouri Transit -0.1226315 -24.5
Texas Grocery_Pharmacy 0.1212200 -14.0
North Dakota Residential -0.1134578 17.0
Virginia Parks 0.1120177 6.0
Wisconsin Grocery_Pharmacy 0.1120093 -1.0
Wyoming Residential 0.1116242 12.5
Massachusetts Transit -0.1080703 -45.0
Nebraska Retail_Recreation 0.1072497 -36.0
Kentucky Residential 0.1063769 12.0
Arkansas Workplace -0.1044910 -26.0
Tennessee Parks -0.1044038 10.5
Kansas Transit -0.1042429 -26.5
Alabama Retail_Recreation 0.1036057 -39.0
Nevada Parks 0.1024623 -12.5
New Hampshire Grocery_Pharmacy -0.1023689 -6.0
Idaho Parks 0.1018440 -22.0
North Carolina Parks -0.1017458 7.0
Ohio Parks -0.1002294 67.5
Texas Transit 0.0989403 -41.0
Oklahoma Grocery_Pharmacy -0.0987242 -0.5
North Carolina Retail_Recreation 0.0980882 -34.0
Massachusetts Residential 0.0977742 15.0
Maryland Transit -0.0975085 -39.0
Washington Residential 0.0971985 13.0
Ohio Residential 0.0954324 14.0
New York Residential 0.0918447 17.5
Missouri Parks 0.0899359 0.0
Minnesota Workplace -0.0887896 -33.0
Georgia Parks 0.0873620 -6.0
Indiana Workplace 0.0862082 -34.0
Virginia Workplace -0.0849390 -31.5
Iowa Workplace 0.0845126 -30.0
California Workplace -0.0845051 -36.0
Oklahoma Workplace 0.0844612 -31.0
Pennsylvania Residential 0.0829373 15.0
Michigan Residential 0.0811007 15.0
Virginia Retail_Recreation -0.0779358 -35.0
Utah Grocery_Pharmacy 0.0745459 -4.0
Maine Residential -0.0740686 11.0
Minnesota Residential -0.0737785 17.0
Michigan Transit 0.0736852 -46.0
Ohio Grocery_Pharmacy 0.0730034 0.0
Minnesota Grocery_Pharmacy 0.0724829 -6.0
Mississippi Parks -0.0716937 -25.0
Washington Transit -0.0713135 -33.5
Colorado Transit 0.0712346 -36.0
Florida Grocery_Pharmacy 0.0712078 -14.0
Iowa Grocery_Pharmacy -0.0700791 4.0
Kentucky Retail_Recreation 0.0696980 -29.0
Michigan Grocery_Pharmacy -0.0669978 -11.0
South Carolina Residential -0.0652675 12.0
Texas Retail_Recreation 0.0635768 -40.0
Oklahoma Residential 0.0635084 15.0
Oregon Retail_Recreation -0.0628074 -40.5
Alabama Residential -0.0613547 11.0
Indiana Grocery_Pharmacy -0.0596366 -5.5
Wyoming Retail_Recreation -0.0591307 -39.0
South Carolina Transit 0.0571351 -45.0
Washington Parks 0.0545960 -3.5
Mississippi Workplace -0.0537027 -33.0
Ohio Retail_Recreation 0.0532866 -36.0
West Virginia Residential -0.0527675 11.0
Oklahoma Parks 0.0527300 -18.5
West Virginia Transit -0.0523837 -45.0
South Carolina Grocery_Pharmacy 0.0519536 1.0
New Hampshire Workplace 0.0478305 -37.0
Indiana Transit 0.0456577 -29.0
Kentucky Workplace -0.0455900 -36.5
Illinois Grocery_Pharmacy -0.0429750 2.0
Vermont Transit -0.0410517 -63.0
Arkansas Transit 0.0365278 -27.0
Ohio Workplace -0.0352062 -35.0
Maine Grocery_Pharmacy -0.0341342 -13.0
Wisconsin Retail_Recreation 0.0325793 -44.0
South Dakota Retail_Recreation -0.0312457 -39.0
Florida Workplace -0.0291999 -33.0
Florida Transit -0.0290003 -49.0
Colorado Grocery_Pharmacy -0.0276385 -17.0
Arizona Workplace -0.0266351 -35.0
Missouri Retail_Recreation -0.0265452 -36.5
Montana Grocery_Pharmacy -0.0251868 -16.0
Tennessee Grocery_Pharmacy 0.0232998 6.0
Illinois Retail_Recreation 0.0222298 -40.0
New Mexico Workplace 0.0216216 -34.0
Colorado Retail_Recreation -0.0196212 -44.0
Georgia Transit -0.0195710 -35.0
Iowa Residential -0.0185413 13.0
Iowa Retail_Recreation 0.0179097 -38.0
Tennessee Transit 0.0175236 -32.0
Colorado Workplace 0.0164206 -39.0
West Virginia Retail_Recreation -0.0157126 -38.5
Arkansas Grocery_Pharmacy -0.0155896 3.0
Nebraska Parks 0.0153613 55.5
Kansas Residential -0.0148261 13.0
Oklahoma Retail_Recreation 0.0141809 -31.0
Oklahoma Transit 0.0137448 -26.0
Kansas Retail_Recreation -0.0095196 -38.0
Tennessee Retail_Recreation -0.0095117 -30.0
Missouri Grocery_Pharmacy 0.0086797 2.0
Maryland Parks 0.0065443 27.0
New Hampshire Transit -0.0053081 -57.0
South Carolina Retail_Recreation -0.0039092 -35.0
Washington Retail_Recreation -0.0036216 -42.0
South Dakota Grocery_Pharmacy 0.0034562 -9.0
Montana Retail_Recreation 0.0026129 -51.0
Nevada Grocery_Pharmacy 0.0008900 -12.5
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Tue Jun 9 20:59:33 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net